PMiSLocMF: predicting miRNA subcellular localizations by incorporating multi-source features of miRNAs

Abstract The microRNAs (miRNAs) play crucial roles in several biological processes. It is essential for a deeper insight into their functions and mechanisms by detecting their subcellular localizations. The traditional methods for determining miRNAs subcellular localizations are expensive. The computational methods are alternative ways to quickly predict miRNAs subcellular localizations. Although several computational methods have been proposed in this regard, the incomplete representations of miRNAs in these methods left the room for improvement. In this study, a novel computational method for predicting miRNA subcellular localizations, named PMiSLocMF, was developed. As lots of miRNAs have multiple subcellular localizations, this method was a multi-label classifier. Several properties of miRNA, such as miRNA sequences, miRNA functional similarity, miRNA-disease, miRNA-drug, and miRNA–mRNA associations were adopted for generating informative miRNA features. To this end, powerful algorithms [node2vec and graph attention auto-encoder (GATE)] and one newly designed scheme were adopted to process above properties, producing five feature types. All features were poured into self-attention and fully connected layers to make predictions. The cross-validation results indicated the high performance of PMiSLocMF with accuracy higher than 0.83, average area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) exceeding 0.90 and 0.77, respectively. Such performance was better than all previous methods based on the same dataset. Further tests proved that using all feature types can improve the performance of PMiSLocMF, and GATE and self-attention layer can help enhance the performance. Finally, we deeply analyzed the influence of miRNA associations with diseases, drugs, and mRNAs on PMiSLocMF. The dataset and codes are available at https://github.com/Gu20201017/PMiSLocMF.


Introduction
MicroRNAs (miRNAs) are a class of short non-coding RNA molecules [1,2], typically composed of ∼20 nucleotides.They have been extensively studied due to their pivotal roles in cellular regulation [3][4][5][6].The miRNAs are involved in the regulation of various biological processes, such as cell proliferation, differentiation and apoptosis, by binding to the mRNA of target genes [1,[7][8][9].More and more studies show that the abnormal expression of miRNA is closely related to a variety of diseases, including cancers [10,11], neurological diseases [12][13][14], and cardiovascular diseases [15][16][17].The miRNAs inf luence the functional state of cells by regulating the expression of target genes at the post-transcriptional level and thus participate in the pathogenesis of diseases.Furthermore, recent studies reveal the close relationship between miRNAs and drugs [18,19].The miRNAs can be regulated by drugs and further served as potential drug targets.The mechanisms of miRNA function are complex and multifaceted.Temporary 'pausing' of mRNA translation by miRNA-induced silencing complex (RISC) is a major function of miRNAs, which occurs in the cytoplasm and processing bodies [20].In addition, miRNAs have shown potential novel functions in other subcellular compartments.
For example, miRNAs that mature in the cytoplasm can be translocated back to the nucleus.The most important function of these nuclear miRNAs is their ability to bind to complementary sequences on Cis-Regulatory Elements (CREs) (e.g., promoters [21][22][23][24] and enhancers [25][26][27]), thereby regulating transcriptional activation or repression of target genes [28].miRNAs are also present in the nucleolus, where they are able to inf luence ribosomal RNA (rRNA) in the cell, which in turn exert their biological activities [29].When miRNAs enter the mitochondria, they bind to mRNAs encoded by mtDNAs, enabling the regulation of mitochondria-related functions, especially in cancer cells [30].Membrane-derived extracellular vesicles (EVs), such as microvesicles, exosomes, and extracellular vesicle, serve as key mediators of intercellular communication.Certain disease states have also identified a potential role for dysregulated EV-miRNA levels in pathogenesis.These diseases include chronic lung disease, immune response, neuroinf lammation, diabetes, cancer and heart disease [31].In view of the above, miRNAs have specific physiological roles at different cellular sites, and their subcellular localization is essential to gain insight into their physiological functions.Correct identification of miRNA subcellular localizations is helpful to advance the progresses in above research fields.Traditional experimental-based methods, such as classical in situ hybridization and high-throughput RNA sequencing techniques [32][33][34], to detect miRNA subcellular localizations are quite expensive.On the other hand, the huge number of detected miRNAs make them impossible to complete detection in time.Thus, it is urgent to develop quick and reliable methods to predict miRNA subcellular localizations.
In recent years, it is popular to design computational methods for investigating various biological problems.The newly proposed computer algorithms provide strong technique support for designing such methods.However, these methods are generally based on abundant data.Fortunately, more and more properties of miRNAs are uncovered, making it possible to use these properties to predict subcellular localizations.To date, some computational methods have been proposed to identify miRNA subcellular localizations.Most methods only used the miRNA sequence information [35][36][37][38][39]. Xiao et al. adopted a bidirectional long short-term memory (BiLSTM) to encode miRNA sequences and LSTM to decode the localization sets [35].Asim et al. proposed the MirLocPredictor to identify miRNA subcellular localizations, which integrated a novel feature extraction scheme, named kmerPR2vec, to yield features from miRNA sequences [36].Later, they built another method, named L2S-MirLoc, for the prediction of miRNA subcellular localizations [37].The method adopted electronion interaction pseudoPotentials (EIIP) to obtain statistical representations of miRNA sequences and used data transformation approach to transform the multilabel miRNA subcellular localization problem into multi-class problems.Meher et al. designed the method, miRNALoc, for predicting miRNA subcellular localizations [38].Features based on pseudo dinucleotide compositions (PseDNC) and di-nucleotide properties (DiPro) were extracted from miRNA sequences in this method, which were further transformed into principal component scores.Liang et al. designed the model MGFmiRNAloc for the prediction of miRNA subcellular localizations [39].The model proposed simplified molecular input line entry system (SMILES) format to represent miRNA sequences and adopted graphical convolutional network (GCN) to access RNA sequence molecular map features.Some other methods were developed using different miRNA properties, such as gene ontology (GO) of miRNAs [40] and miRNA-mRNA associations [41].Yang et al. first developed a novel scheme to measure the functional similarity for miRNAs based on their GO annotations and used the similarity score to predict miRNA subcellular localizations [40].Xu et al. employed the associated mRNAs of miRNAs and their subcellular localizations to build the network method MiRLoc for detecting miRNA subcellular localizations [41].Recently, Bai et al. proposed the method DAmiRLocGNet, which extracted miRNA features from sequences and miRNA-disease associations [42].Its performance was better than MiRLoc in detecting miRNA subcellular localizations.Although the aforementioned methods shown good performance in determining miRNA subcellular localizations.An evident limitation exists.Most methods only adopted single miRNA properties to set up the method, inducing the incomplete representations of miRNAs.This incomplete representation inf luences the performance of the following prediction.The DAmiRLocGNet adopted sequences and miRNAdisease associations to represent miRNAs.However, it is still not enough to fully encode miRNAs.As mentioned above, miRNAs are highly associated with mRNAs [1,[7][8][9], diseases [10][11][12][13][14][15][16][17], and drugs [18,19].Employment of all this information is helpful to fully represent miRNAs, thereby setting up more powerful computational methods.
In this study, a novel multi-label classifier, named PMiSLocMF (predicting miRNA subcellular localizations with multi-source features), was developed for the prediction of miRNA subcellular localizations.Several properties of miRNAs, such as miRNA sequences, miRNA functional similarity, miRNA-disease, miRNAdrug and miRNA-mRNA (together with subcellular localizations of mRNAs) associations, were fused in the classifier.The powerful algorithms [node2vec [43] and graph attention auto-encoder (GATE) [44]] and a newly designed scheme processed above miRNA properties to yield five informative miRNA feature types.All these features were fed into self-attention and fully connected layers to make predictions.The cross-validation results shown that PMiSLocMF yielded accuracy higher than 0.83, average area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) on seven subcellular localizations higher than 0.90 and 0.77, respectively.Such performance was better than all previous methods based on the same dataset.The classifier also had strong generalization ability by testing it on an independent test dataset.Further tests elaborated that all feature types provided positive contributions for predicting miRNA subcellular localizations, and GATE and self-attention layer can help improve the performance.Moreover, the inf luence of miRNA associations with diseases, drugs, and mRNAs on PMiSLocMF was deeply analyzed.

Benchmark dataset
A well-defined dataset is very important for building efficient classifiers.In this study, we obtained the human miRNA subcellular localization dataset S from two previous studies [41,42], which was originally retrieved from RNALocate database (version 2.0) [45].There are 1041 miRNAs in this dataset.These miRNAs are all in the functional similarity network constructed in [41] and are classified into seven subcellular localizations, including cytoplasm, exosome, nucleolus, nucleus, extracellular vesicle, microvesicle, and mitochondrion.Each localization contained more than 50 miRNAs.For formulation, the miRNAs having one subcellular location constitute a subset, denoted by S cytoplasm , S (exosome), S nucleolus , S nucleus , S extracellular vesicle , S microvesicle , S mitochondrion .Then, the benchmark dataset S can be formulated by.
As some miRNAs can have multiple subcellular localizations, the intersection of some above subsets contains common miRNAs.To illustrate this fact, an upset graph was plotted, as shown in Fig. 1.It can be observed that miRNAs having exosome are the most (870) and those having nucleolus are the least (67).Lots of miRNAs have multiple subcellular localizations, where 41 miRNAs have all seven localizations.Thus, when subcellular localizations are deemed as labels and miRNAs are regarded as samples, it is a multi-label classification problem to classify miRNAs into subcellular localizations.
In addition, we constructed an independent test dataset from RNALocate database (version 2.0) [45] to test the generalization ability of the proposed model.All human miRNAs with subcellular localizations in RNALocate database were downloaded.After removing above 1041 miRNAs, 2793 miRNAs were obtained.Then,

Multi-source properties of miRNAs
Recently, it is very popular to employ multiple properties of objects for building efficient classifiers.To date, several miRNA properties have been uncovered and collected in some public databases, making it possible for predicting miRNA subcellular localizations using these properties.This study adopted multiple properties of miRNAs, which are introduced as below.

miRNA-disease association network
In recent years, lots of evidences have been reported that miRNAs have strong associations with diseases.Diseases can be deemed as the functions of miRNAs and have been used to predict miRNA subcellular localizations [42].This study adopted the miRNAdisease associations collected in [42], involving 1041 miRNAs, 640 diseases, and 15 547 miRNA-disease associations.In fact, these associations were extracted from the miRNA-disease associations reported in human miRNA disease database (HMDD, version 3.2) [46].Based on this association information, an association network was built.The 1041 miRNAs and 640 diseases were termed as nodes.One miRNA-disease association determined one edge in this network.For convenience, this network was denoted by N mi−di .In addition, above miRNA-disease associations can be represented by a matrix MD, where rows represented 1041 miRNAs and columns indicated 640 diseases.MD m i , d j = 1 if miRNA m i and disease d j can constitute a miRNA-disease association; otherwise, it was set to zero.Based on this matrix, the Gaussian interaction profile (GIP) kernel similarity between two miRNAs m i and m j was calculated by where

miRNA sequences and similarity network
The sequences of miRNAs are always the first-hand information for investigating miRNAs.They were also used in this study.For all miRNAs mentioned in Section Benchmark dataset, their sequences were sourced from miRBase database [47].The Smith-Waterman algorithm [48] was adopted to measure the similarity of two miRNA sequences.For miRNAs m i and m j , their sequence similarity can be computed by where sp m i , m j stands for the local alignment score of two sequences.For convenience, the similarity score between any two miRNA sequences was downloaded from the previous study [42].
As lots of outcomes of Eq. 3 were zero, we used miRNA GIP kernel similarity for supplement, formulated as Based on the outcomes of Eq. 4, a miRNA sequence similarity network was constructed, denoted by N S .The 1041 miRNAs were defined as nodes, whereas two miRNAs were connected if and only if their sequence similarity score was larger than zero.Furthermore, the similarity score was assigned to the corresponding edge as its weight.

miRNA-drug association network
Like diseases, drugs have also been confirmed to be related to miRNAs.The drugs related to one miRNA may be helpful to predict its subcellular localizations.This information was adopted in this study to construct the classifier.The miRNAdrug associations were retrieved from ncDR [49], a comprehensive chemoinformatics and bioinformatics resource collecting curated and predicted non-coding RNAs associated with drug resistance.After restricting to 1041 miRNAs, 3305 miRNA-drug associations were obtained.These associations involved 130 drugs.Accordingly, another association network containing 1041 miRNAs and 130 drugs as nodes was set up.The miRNA-drug associations determined the edges in this network.Let us denote this network as N mi−dr .

miRNA functional similarity network
In addition to sequence similarity network, we also employed the miRNA functional similarity network.The Wang et al.'s method was adopted to access the functional similarity of miRNAs [50], which was based on the diseases of miRNAs.In medical subject heading (MeSH), all diseases are placed in an acyclic graph (DAG).The edges in this DAG indicates the children or parents of diseases.Each disease di can be represented by a subgraph containing it and all its ancestor nodes.The contribution of a disease dt in this subgraph to the disease di is defined as where is a parameter, which was suggested to set 0.5 [50].Then, the semantic value of disease di is computed by where A(di) is a set containing di and all its ancestor nodes.Then, the semantic similarity of two diseases di and di is defined as For two miRNAs m i and m j , their functional similarity can be measured according to their related to diseases.The diseases related to m i constitute the disease set DD (m i ) and that for m j is denoted as DD m j .The relationship between DD (m i ) and DD m j can indicate the functional similarity of m i and m j , which can be calculated by.
where S di, DD m j = max SS di, di : di ∈ DD m j ) .Likewise, above similarity was fused with miRNA GIP kernel similarity (Eq.2) to reduce the phenomenon that lots of outcomes of Eq. 8 were zero, defined by The miRNA functional similarity network was constructed with the above definition on functional similarity of any two miRNAs.It defined 1041 miRNAs as nodes and the edges in it were determined by a binarization threshold T. If the outcome of Eq. 9 was larger than T, the corresponding miRNAs were adjacent.The obtained miRNA functional similarity network was denoted by N F .

miRNA-mRNA association network and subcellular localizations of mRNAs
mRNAs are a widely studied type of RNAs.Their subcellular localizations have been well-studied.On the other hand, the special associations between miRNAs and mRNAs have been partly revealed in recent years.One previous study confirmed that the subcellular localization data of miRNA target mRNAs are helpful to identify the subcellular localizations of miRNAs [41].Thus, we also adopted the information on miRNA-mRNA associations and subcellular localizations of mRNAs to construct the classifier.
The miRNA-mRNA associations were obtained from Xu et al.'s study [41].This information contained 8254 associations, involving 1041 miRNAs and 2836 mRNAs, which was extracted from the database miRTarBase 2020 [51] and screened by using immunohistochemistry, luciferase reporter assay, northern blot, qRT-PCR or another experimental method.After that, we built the third association network, which defined 1041 miRNAs and 2836 mRNAs as nodes and 8254 associations as edges.This network was denoted by N mi−m .
Besides the miRNA-mRNA association network, we also employed the subcellular localizations of above 2836 mRNAs, which were also used in Xu et al.'s study [41].However, only four subcellular localizations: cytoplasm, exosome, nucleolus, and nucleus, contained more than 50 mRNAs.Other three subcellular localizations (extracellular vesicle, microvesicle, and mitochondrion) contained less than 50 mRNAs, which were removed for data validity.Likewise, some mRNAs have multiple subcellular localizations.Figure 2 shows the intersections of four mRNA subsets, corresponding to four subcellular localizations.mRNAs having exosome are the most (2741) and 877 mRNAs have all four localizations.
In this section, several properties of miRNAs were employed for constructing the classifier, including three association networks (Table 2), miRNA sequence and functional similarity networks.In addition, the subcellular localizations of mRNAs were also fused in the classifier to improve the performance.

Multi-source features of miRNAs
Several properties of miRNAs were introduced in Section Multisource properties of miRNAs.How to perfectly fuse them into the classifier is a challenging problem.Some existing or newly designed methods were adopted to tackle these properties.Informative miRNA features were obtained with the help of these methods.

Features yielded by node2vec
Network is now a popular form to investigate various problems, as the three miRNA association networks, miRNA sequence and functional similarity networks used in this study.To tackle such non-Euclidean data, some powerful methods have been proposed.Among them, network embedding algorithms are an important member, which can assign a numeric vector to each node in one or more networks.Node2vec is one of the most widely used network embedding algorithms [43].Its detailed description is available in Supplementary Material I.In our study, we employed node2vec to extract miRNA features from miRNA sequence similarity network, miRNA-disease, miRNA-drug, and miRNA-mRNA association networks.In detail, from miRNA sequence similarity network N S , a 64-D (dimension) vector was obtained for each miRNA through node2vec; whereas node2vec produced a 128-D vector for each miRNA from miRNA-disease, miRNA-drug, and miRNA-mRNA association networks, respectively.

Features yielded by GATE
In addition to network embedding algorithms, GCN [52] is another representative method to process networks.It can learn a new representation of each node from its raw representation and the given network.Here, a special GCN, GATE [44], was employed to yield high-level miRNA features.GATE contains encoder and decoder procedures.The encoder updates the raw representation of each node by considering the representations of its neighbors.
The decoder tries to recover the raw representations of nodes as perfect as possible.The detailed description of GATE is provided in Supplementary Material I.
In present study, we applied GATE to the raw miRNA features derived from miRNA-disease, miRNA-drug, and miRNA-mRNA association networks through node2vec.The miRNA functional similarity network N F was also fed into GATE.Finally, we obtained three 128-D feature vectors of each miRNA, one is the improved features derived from miRNA-disease association network, the second one is obtained from miRNA-drug association network, and last one contains high-level features derived from miRNA-mRNA association network.

Features derived from miRNA-mRNA association network and subcellular localizations of mRNAs
In Xu et al.'s study [41], they have validated that interacting miRNAs and mRNAs tend to be localized to the same subcellular localizations.Furthermore, they demonstrated that incorporating the subcellular localization information of target mRNAs can help improve the performance of their model.Here, we designed a scheme to extract essential features from miRNA-mRNA associations and subcellular localizations of mRNAs.
Given a miRNA m i , its target mRNAs can be extracted from the miRNA-mRNA association network, i.e., the neighbors of m i in this network.They constitute the mRNA set, denoted by M (m i ).However, directly using the sizes of above four subsets is not a perfect manner.We refine them as follows: Under this operation, outcomes of Eq. 10 are all between 0 and 1, which is more proper to represent miRNA m i .Thus, four features are obtained from miRNA-mRNA association network and subcellular localizations of mRNAs for each miRNA.
With above procedures, each miRNA can be encoded by five feature types.The first feature type was obtained from miRNA sequence similarity network through node2vec.The second feature type was derived from miRNA-disease association network through node2vec and was further processed by GATE and miRNA functional association network.The third and fourth feature types were similar to the second one, which were derived from miRNA-drug and miRNA-mRNA association networks, respectively, rather than miRNA-disease association network.The last feature type was derived from miRNA-mRNA association network.To obtain more informative features, the information of mRNA subcellular localizations was also employed to generate this feature type.For convenience, above five feature types were called miRNA sequence, disease, drug, mRNA network and mRNA co-localization features.The details of these five feature types are listed in Table 3.

Outline of the PMiSLocMF
In this study, we utilized multiple miRNA properties derived from different sources to build the multi-label classifier, named PMiS-LocMF, for the prediction of miRNA subcellular localizations.The entire procedures are illustrated in Fig. 3.
As mentioned in Section Multi-source properties of miRNAs, several miRNA properties, including miRNA sequence and functional similarity networks, miRNA-disease, miRNA-drug, and miRNA-mRNA association networks, were employed.With these properties, five feature types were generated to represent each miRNA.All these feature types were concatenated into one feature vector, which was fed into the prediction procedure.This procedure contained two parts: self-attention layer and fully connected layer.In the self-attention layer, self-attention mechanisms learn the weights between features for better representing the internal structure of the input features, which can help the model capture complex dependencies between features (see Supplementary Material I for detail).The feature vector processed by the self-attention layer were subjected to the fully connected layer, which contained two hidden layers and one output layer.Rectified linear unit (ReLU) activation functions were employed for two hidden layers and the activation function of output layer was sigmoid.The loss function was the binary cross-entropy, which is defined as.
where y i represents the observed label, ŷi stands for the predicted probability.The trainable parameters in self-attention and fully connected layers were optimized by Adam optimizer [53].The multi-label classifier was implemented by TensorFlow 2.11.0 and Scikit-learn [54].

Evaluation metrics
Cross-validation is a commonly used method to assess the performance of classifiers [55].Here, we adopted 10-fold crossvalidation to evaluate the performance of classifiers.The crossvalidation results were counted as several measurements, including aiming, coverage, accuracy, absolute true, and absolute false [56][57][58][59][60][61][62][63][64][65].Besides the overall measurements, we also employed the popular receiver operating characteristic (ROC) and precisionrecall (PR) curves for each label, along with the area under these two curves, denoted by AUC and AUPR.The detailed descriptions of above measurements are provided in Supplementary Material I.

Performance of PMiSLocMF
The parameters of PMiSLocMF were tuned as described in Supplementary Material I and the optimal parameters are provided in Table S1.PMiSLocMF with optimized parameters was evaluated by 10-fold cross-validation.It was found that the aiming, coverage, accuracy, absolute true, and absolute false were 0.9067, 0.9087, 0.8374, 0.5000, and 0.1012, respectively.The aiming and coverage exceeded 0.9 and accuracy researched 0.8, indicating the good performance of PMiSLocMF.
In addition, we also counted the performance of PMiSLocMF on seven subcellular localizations, measured by AUC and AUPR.The ROC and PR curves on seven localizations are shown in Fig. 4. The AUC values for cytoplasm, exosome, nucleolus, nucleus, extracellular vesicle, microvesicle, and mitochondrion were 0.8909, 0.9513, 0.9267, 0.8764, 0.8574, 0.9502, and 0.8702, respectively.These values were all higher than 0.85, suggesting the good performance of PMiSLocMF on each subcellular localization.As for AUPR values, they were 0.8192, 0.9905, 0.5298, 0.8763, 0.4695, 0.9866, and 0.7294.The AUPR values on cytoplasm, exosome, nucleus, microvesicle exceeded 0.8, whereas those on nucleolus and extracellular vesicle were not satisfied, slightly higher or lower than 0.5.By observing Fig. 1, the sizes of nucleolus and extracellular vesicle were smallest.When considering the performance of PMiSLocMF on them, the negative samples were much more than positive samples.AUPR is more sensitive to the imbalanced problem than AUC.All these induced the low AUPR values on these two subcellular localizations.The average AUC and AUPR were further calculated, yielding average AUC of 0.9033 and average AUPR of 0.7716.

Effectiveness of five feature types
The proposed classifier PMiSLocMF used five feature types: miRNA sequence, disease, drug, mRNA network and mRNA colocalization features.Generally, the classifier can provide better performance if more related features are adopted.In this section, some ablation tests on features were conducted to prove the contribution of each feature type to PMiSLocMF and more features can yield better performance.
Five feature types can yield thirty combinations except that used in PMiSLocMF.For each feature type combination, a classifier was built and evaluated by 10-fold cross-validation.The overall performance of these thirty classifiers is listed in Table 4.The performance of PMiSLocMF is also listed in this table for easy comparisons.It was found that PMiSLocMF, which used all five feature types, provided the highest aiming, coverage, accuracy, absolute true, average AUC, and average AUPR, whereas it gave the lowest absolute false.These results indicated that using all feature types made the classifier more powerful.In addition, we also counted the performance of above thirty classifiers on seven subcellular localizations.The AUC and AUPR values are provided in Tables 5 and 6, respectively.The same measurements yielded by PMiSLocMF are also given in these two tables for easy comparisons.It can be observed that PMiSLocMF provided the highest AUC values on four localizations (exosome, nucleus, microvesicle, and mitochondrion) and also yielded the highest AUPR values on four localizations (exosome, nucleus, extracellular vesicle, and microvesicle).Such results suggested that PMiSLocMF was superior to thirty classifiers.As other thirty classifiers lacked one or more feature types, each feature type provided contributions for PMiSLocMF.
With above arguments, each feature type was essential for PMiSLocMF.However, their importance may not be same.From the overall performance of classifiers using single feature type (first five rows in Table 4), it was found that the classifier using miRNA sequence features (represented by α in Table 4) generally provided the best performance, followed by that using miRNA disease features (represented by β in Table 4), miRNA drug features (represented by γ in Table 4), miRNA mRNA network features (represented by δ in Table 4), and miRNA mRNA co-localization features (represented by ε in Table 4).This sequence was also the importance sequence for PMiSLocMF to predict miRNA subcellular localizations from the most to the least.
It was also interesting to investigate whether more features can improve the performance of classifiers.In view of this, we divided above-mentioned thirty classifiers into four groups: classifiers using one, two, three or four feature types.The PMiSLocMF constituted the fifth group as it used all five feature types.The average aiming, coverage, accuracy, absolute true, absolute false, AUC, and AUPR were counted for each classifier group, which are illustrated in Fig. 5.It was found from Fig. 5(A) that all overall measurements, except absolute false, followed a strictly monotone increasing trend with the addition of feature types, whereas absolute false followed a contrary trend.As for AUC and AUPR values on seven subcellular localizations (Fig. 5(B)-(C)), they also followed a strictly monotone increasing trend with the increment of feature types.All these implied that when more feature types were used, the classifier became better and better.
Finally, we analyzed the relationships between five feature types and seven subcellular localizations.Take the performance of PMiSLocMF on seven subcellular localizations as the baseline (last row in Tables 5 and 6), if the removal of one feature type (i.e., the performance of classifiers with four feature types) induced a sharp drop on the performance on one subcellular localization, this feature type was deemed to be highly related to this subcellular localization.Thus, we looked up the test results in Tables 5 and 6, and set the threshold of decline to 0.02 for easy analysis.Related subcellular localizations for each feature type are listed in Table 7.After taking the common subcellular localizations selected by AUC and AUPR, the following highly related subcellular localizations for some feature types were obtained.For miRNA sequence features, its related subcellular localizations were nucleus and microvesicle.For miRNA disease features, nucleus was the only related subcellular localization.For miRNA drug features, its related subcellular localizations were nucleus and cytoplasm.Such relationships between feature types and subcellular localizations may give insights for improving our model.For example, the further refined miRNA disease features may improve the model's performance on nucleus.

Effectiveness of GATE
The GATE was employed to improve the raw miRNA features derived from miRNA-disease, miRNA-drug, and miRNA-mRNA association networks.It was necessary to investigate whether GATE can really improve the performance of PMiSLocMF.Thus, we removed GATE and rebuilt the classifier.This classifier was also evaluated by 10-fold cross-validation.All measurements of this classifier are listed in Table 8.For easy comparisons, the measurements of PMiSLocMF are also provided in this table.It can be observed that aiming, coverage, accuracy, absolute true, average AUC, and average AUPR all decreased, whereas absolute false increased.In detail, the accuracy and absolute true declined by 0.06 and 0.09, respectively, which were significant decrease.As for the AUC and AUPR values on seven subcellular localizations, they also decreased compared with those of PMiSLocMF.These results suggested that the classifier without GATE was inferior to PMiSLocMF, proving the utility of GATE for building the classifier.

Effectiveness of self-attention layer
The self-attention layer was designed to further improve the quality of features before the final prediction.It was also necessary to investigate its effectiveness in building PMiSLocMF.Thus, we  The AUC values on seven subcellular localizations dropped by 0.0005-0.02and AUPR values decreased by the similar degree.Accordingly, we can conclude that the classifier without selfattention layer was inferior to PMiSLocMF, implying the utility of self-attention layer for improving the performance of PMiSLocMF.

Deep analysis of the performance of PMiSLocMF on associated diseases, drugs, and mRNAs
In this study, we used three association networks (Table 2) to yield essential miRNA features.These networks indicated the associations between miRNAs and three objects (diseases, drugs, and mRNAs).Clearly, the numbers of associated diseases, drugs, and mRNAs for miRNAs were not same.It was interesting to investigate whether the number of associated diseases, drugs, and mRNAs can inf luence the performance of PMiSLocMF.In view of this, we first ranked the miRNAs with the decreasing order of the numbers of their associated diseases and equally divided all miRNAs into two groups.The first group contained miRNAs that have many associated diseases, whereas the second group consisted of the rest miRNAs that have few associated diseases.The threshold on the number of associated diseases for two miRNA groups was five.These two groups were called 'strongly' and 'weakly' groups.With the same operations, miRNAs can also be divided into two groups in terms of the numbers of their associated drugs or mRNAs.The thresholds for numbers of associated drugs and mRNAs were two and one, respectively.For the 10-fold cross-validation results of PMiSLocMF, we individually counted five overall measurements (aiming, coverage, accuracy, absolute true, and absolute false) on two groups.These measurements are listed in Table 9.It was amazing that the performance of PMiSLocMF on 'weakly' group was evidently better than that of the 'strongly' group.In general, the miRNAs in 'strongly' group had more associated diseases, drugs, or mRNAs, which provided more informative materials for extracting effective features.Thus, the performance on 'strongly' group should be better than that on 'weakly' group.However, the test results were on the contrary.In view of this, we further counted the number of labels for miRNAs and drew a box plot to show the label number distribution in each miRNA group, as displayed in Fig. 6.It can be found that miRNAs in 'strongly' group generally had more labels, i.e., subcellular localizations, than those in 'weakly' group.In detail, the average label numbers for miRNAs in three 'strongly' groups were 4.06, 3.96, and 4.04, whereas these numbers for miRNAs in three 'weakly' groups were 2.44, 2.55, and 2.46.Furthermore, the span of label numbers for 'strongly' miRNA group was also larger than that of 'weakly' miRNA group.These facts increased the difficulty in correct prediction of subcellular localizations of miRNAs in 'strongly' group.Thus, for the constructed classifier, its prediction on miRNAs with weak associations to other objects was more reliable than those with strong associations.Furthermore, the inf luences of associated diseases, drugs, and mRNAs on PMiSLocMF were not same.According to Table 9, the gap on absolute true in two groups can be computed.It is easy to obtain that the gap for associated diseases was greatest, followed by that for associated drugs and mRNAs.It was indicated that the number of associated diseases provided the most inf luence on PMiSLocMF, the number of associated drugs stood at the second place, and the number of associated mRNAs gave the least effects.

Conclusion
This study proposed a novel multi-label classifier for predicting miRNA subcellular localizations.This classifier fused several miRNA properties and employed or designed feature extraction schemes to generate essential miRNA features from these properties.The cross-validation results indicated the good performance of the classifier and it was superior to all existing methods based on the same dataset.The ablation tests shown that more miRNA properties were helpful to improve the classifier, and GATE and self-attention layer gave key contributions.We also found that the classifier can give correct predictions for miRNAs with weak associations to diseases, drugs, or mRNAs.It is hopeful that the proposed classifier can be a useful tool in determining the subcellular localizations of miRNAs.The dataset and codes are available at https://github.com/Gu20201017/PMiSLocMF.

Figure 1 .
Figure 1.Upset graph to show the intersections of human miRNAs in seven subcellular localizations.Lots of miRNAs have multiple subcellular localizations.

Figure 2 .
Figure 2. Upset graph to show the intersections of human mRNAs in four subcellular localizations.Many mRNAs have multiple subcellular localizations.

Figure 3 .
Figure 3. Entire procedures for constructing PMiSLocMF.Several miRNA properties are employed, which are used to extract miRNA features through node2vec, graph attention auto-encoder, and a newly designed scheme.All obtained miRNA features are concatenated and fed into the self-attention and fully connected layers to make predictions.

Figure 4 .
Figure 4. Performance of PMiSLocMF on seven subcellular localizations measured by ROC and PR curves.(A) ROC curves; (B) PR curves.All AUC values are high (>0.85)and four AUPR values are high (>0.8).

Figure 5 .
Figure 5. Performance of classifiers using one, two, three, four and five feature types.(A) Overall performance; (B) Performance on seven subcellular localizations measured by AUC; (C) Performance on seven subcellular localizations measured by AUPR.When more feature types are used, the classifiers provide better performance.

Table 1 .
Distribution of 41 miRNAs in the independent test dataset on seven subcellular localizations.

Table 2 .
Overview of three association networks for miRNAs in dataset S.

Table 3 .
Breakdown of miRNA features used in PMiSLocMF.
i ) can be classified into four subsets, denoted by M cytoplasm (m i ), M exosome (m i ), M nucleolus (m i ), and M nucleus (m i ), respectively, where M cytoplasm (m i ) contains the mRNAs in M (m i ) that have subcellular localization cytoplasm, and other three subsets are defined in a similar way.If one subset is obviously larger than other three subsets, m i has the corresponding subcellular localization with a high probability.

Table 4 .
Ablation test results on features (overall performance).

Table 5 .
Ablation test results on features (performance on seven subcellular localizations measured by AUC).